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Linear systems solvers - recent developments and implications for lattice 
computations 

A. Frommer'* 

'^Fachbereich Mathematik, Universitat Wuppertal, 
42097 Wuppertal, Germany 

We review the numerical analysis' understanding of Krylov subspace methods for solving (non-hermitian) 
systems of equations and discuss its implications for lattice gauge theory computations using the example of the 
Wilson fermion matrix. Our thesis is that mature methods like QMR, BiCGStab or restarted GMRES are close to 
optimal for the Wilson fermion matrix. Consequently, preconditioning appears to be the crucial issue for further 
improvements. 



1. KRYLOV SUBSPACE METHODS 



Given a linear system of equations 
Mx = b 



(1) 



with M S C"^" being non-singular, the class of 
Krylov subspace iterative methods for solving (|^) 
is characterized by the following generic template 



choose initial guess x", set r'^ — 
for m — 1,2, . . . until convergence 
compute iterate of the form 



L(A/)rO 



Here, Qm-i is a polynomial of degree < to — 1. 
For the residual r'" — b — AIx™ we therefore get 



,(M)r", 



(2) 



where is the polynomial p„j (i) = 1 — 
In an algorithmic description of virtually any 
Krylov subspace method, the polynomials qm-i 
or pm are not explicitly present, but they are cru- 
cial to a theoretical analysis of the method. More- 
over, the relation (^) is also the key to under- 
standing the condition ('difficulty') of the linear 
system to be solved and we start by discussing 
this point. 

1.1. Condition 

Assume that M is diagonaizable, i.e. we have a 
decomposition of the form 



where A is diagonal with its diagonal contain- 
ing the eigenvalues, and V is the correspond- 
ing matrix of (right) eigenvectors. Denoting the 
spectrum of M by a{M), from (||) we now get 



\\n\<\\V\\-\\V-'\\-\\r'\\-\\p^{A)\ 
Since Pm(A) is diagonal we have 



(3) 



lbm(A)|| = max |pm(A) 
Ae<T(A/) 



Considering || and ||r'^|| as constants, 

the best bound any Krylov subspace method can 
achieve in (||) is the one obtained for the polyno- 
mial which minimizes ||pm(A)||. In this sense, the 
quantities 



mm 

g(p„)<T„ 
P,Tl(0) = l 



max |n™(A)|, to ^ 

\£a(M) 



0,1,... (4) 



represent a measure of the condition of the system 
(0) , since no Krylov subspace method can achieve 
a better bound in (||) than the one which replaces 
|pm(A)|| by Cm- Finding the optimal polynomial 
in (||) is a complex approximation problem for 
which solutions are known only in special cases. 
However, it is clear that due to the restriction 
Pm(0) = 1 the numbers Cm will tend to zero only 
slowly if there are many eigenvalues close to 0, 
particularly if they are distributed quite evenly 
around 0. 
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1.2. Optimal methods 

A Krylov subspace method is feasible algo- 
rithmically if it requires only a finite amount of 
ressources like storage and computer time. We 
express this fact by saying that the method can 
be implemented using short recurrencies, mean- 
ing that all quantities needed at iteration m can 
be computed from those of a small number of pre- 
vious iterations. Note that each Krylov subspace 
method will require at least one multiplication 
with M per iteration to account for the fact that 
the degree of the polynomial Pm will increase as 
the iteration proceeds. The following theorem 
shows that optimality and short recurrencies can 
only be achieved for a restricted class of matrices. 

Theorem 1.1 A Krylov subspace method which 
achieves optimality, i.e. 

||r™||= min |b(M)r°|| (5) 

Pm(0) = l 

for every initial residual r^ and which can be im- 
plemented using short recurrencies exists only if 
M is of the form 

M = e*®(r + iai), where T ^ T'' and cr, 9 £ IR. 

This theorem also holds if || ■ || is replaced by 
an energy norm of the form ||x||^f — x^Hx 
with H £ £rixn ]^g]-jjjj^;ian and positive definite. 
For M hermitian and positive definite the CG 
method achieves optimality in the energy norm 
with H — M . For M hermitian (but possibly in- 
definite) , MINRES P is optimal in the Euclidian 
norm. The paper |^] gives algorithmic descrip- 
tions for optimal methods in the other cases of 
Theorem 1.1. Note that the above theorem in- 
cludes matrices of the form al + S with 5*^ = — 5* 
(take 8 = — 7r/4), arising for staggered fermions. 

2. NON-HERMITIAN SYSTEMS 

The last 10 years have seen tremendous 
progress in Krylov subspace methods for solv- 
ing linear systems which, like the Wilson fermion 
matrix, do not fall into the category covered by 
Theorem 1.1. See |,| for an overview. For sim- 
plicity, such systems will just be termed 'non- 
hermitian' in the sequel. In these cases one must 



find an adequate compromise between the qual- 
ity of the Krylov subspace method to use and the 
ressources required by the method. 

The first method of this kind is the BiCG 
method Here, an additional shadow residual 
r is selected and the m-th iterate is defined 
by the Galerkin condition 

(r")tp„(M)f = 

for all polynomials Pm of degree < m. In case that 
M is hermitian positive definite and f = r*^ the 
method reduces to the CG method. BiCG needs 
two matrix multiplies (one with M and one with 
M^) per iteration and the residuals typically un- 
dergo quite large variations. Moreover, there are 
situations where the method breaks down (due to 
division by zero) without having reached a solu- 
tion. Although exact breakdowns do rarely occur 
in practice, near breakdowns severely affect the 
numerical stability. 

2.1. QMR 

QMR, the quasi minimal residual method of 
0, can be regarded as one way to make BiCG 
more reliable. As BiCG it is based upon the 
non-symmetric Lanczos process to compute an 
appropriate basis vi, . . . , Vm of the Krylov sub- 
space Km{M,r°) = {pi{M)r°, degpi < to - 1}. 
The TO-th residual r™ is characterized by the co- 
efficient vector («!,...,«„) in r™ = Y^'iLi'^i'^i 
having minimal norm subject to the condition 
= pm{My,degpm < TO,Pm(0) = 1. If 
the Lanczos vectors ui , . . . , Vm were orthogonal 
this would imply that r™ is minimal. Since for 
non-hermitian matrices the Lanczos vectors are 
not orthogonal, minimizing the coefficient vec- 
tor merely implies a 'quasi' minimality of 
whence the name QMR. QMR eliminates one 
source of breakdowns present in BiCG. Moreover, 
using a look-ahead strategy in the non-symmetric 
Lanczos process, almost all other (exact or near) 
breakdowns are also avoided at the price of ex- 
tra storage. All these features are implemented 
in QMRPACK which is available from netlib. As 
in BiCG each iteration costs one multiply with 
M and one with M^. The quite smooth conver- 
gence of QMR is also justified by the theoretical 
analysis. 
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2.2. J-hermitian matrices 

A matrix M is said to be J-hermitian if there 
exists a matrix J such that 

MJ = JM^ . 

In this particular case, the non-symmetric Lanc- 
zos process can be made less costly, since through 
the right choice of the 'shadow residual' r all mul- 
tiplications with Aft can be replaced by multi- 
plications with J 1^. Consequently, BiCG and 
QMR require only one multiply with M and one 
with J in each iteration. For the Wilson fermion 
matrix we have J = 75 and thus multiplies with J 
are by far more cheaper than with M . Exploiting 
the 75-symmetry thus makes QMR (and BiCG) 
competitive to the other methods discussed in 
this section, see (|Jl^. At the time of writing 
this article, including the J-hermitian case into 
QMRPACK was under preparation but not 
yet completed. 

2.3. BiCGStab 

The BiCGStab method is another way to 
stabilize BiCG. Here, multiplies with are re- 
placed by multiplies with M such that an ad- 
ditional one-dimensional minimization process is 
performed during each iteration. 

All computational effort, in particular, all ma- 
trix multiplies is spent working on the iterates 
of the system to solve. Typically, BiCGStab 
produces less varying residuals than BiCG, al- 
though the same sources for breakdowns are still 
present. BiCGStab is quite easy to implement 
'from scratch'. Some variations are described in 

2.4. Restarted GMRES 

In contrast to the Lanzcos process, the 
Arnoldi process computes an orthogonal basis of 
Km{M,r^) for a general non-hermitian matrix 
M . From the Arnoldi basis it is possible to calcu- 
late an optimal iterate (such that r™ satisfies 
(H) ) by solving a small least squares problem. 

The resulting method is called GMRES, the 
generalized minimal residual method How- 
ever, the Arnoldi process does not rely on short 
recurrencies requiring m vectors of storage and 
O(m^) inner products to be computed. 



One therefore has to stop GMRES after a cer- 
tain number {k, say) of iterations and restart 
the process with the current iterate as a new 
initial guess. The resulting method is termed 
restarted GMRES or GMRES(fc). For fc = 1, a 
restart is done after every iteration. Hence, GM- 
RES (1) is identical to the familiar MR method 
|[T^], where the iterate x"^~^^ is obtained by mini- 
mizing r"^^^{t) = h — M (x™ + tr™) with respect 
to t G C. There are situations where GMRES(fc) 
stagnates without reaching a solution, even for 
large restart values k, but if all eigenvalues of M 
lie in the right half plane GMRES (A:) is known to 
converge for all k P,p^. 

3. PRECONDITIONING 

We have seen in Section 1 that the eigenvalue 
distribution of M determines a bound on the 
maximal speed of any Krylov subspace method 
for M. Once we have a method which is close 
to optimal, the only way of getting further im- 
provement is to change the matrix M to one for 
which the eigenvalue distribution is more favor- 
able. This is precisely the purpose of precon- 
ditioning where the original system Mx = 6 is 
changed to 

V^^MV^^x^b (6) 

with b = V{~^b and x = V2X. The matrices Vi, V2 
are called the left and right preconditioner, resp., 
and their product ^ = V1V2 is often referred to 
as the preconditioner. Note that the spectrum of 
V^^MV^^ is identical to that of V~^M, so that 
the effect of preconditioning on the eigenvalue dis- 
tribution is determined by V alone but not by its 
factorization V = ViV2- A preconditioner should 
approximate M (so that the eigenvalues of V^^M 
cluster around 1). On the other hand, perform- 
ing a Krylov subspace method on the precondi- 
tioned system requires multiplies with the precon- 
ditioned matrix like in z = V^^MV^^y which are 
normally obtained via 

solve V2W = y, V — Mw, solve Viz — v. 

Preconditioning thus introduces additional solves 
with the matrices Vi and V2 and this overhead 
should not be too expensive in order to get an 



4 



efficient metiiod. A good preconditioner is always 
a compromise between the latter requirement and 
the fact that V should well approximate M. 

Conceptually, one may distinguish between two 
types of preconditioners: In problem oriented pre- 
conditioners the matrix V is taken as a simpler or 
reduced (with respect to M) representation of the 
underlying physical problem. Algebraic precondi- 
tioners are obtained directly from M without re- 
course to the application from which M arises. 
Interestingly, algebraic preconditioners seem to 
be more successful than problem oriented ones 
in QCD computations and we therefore focus on 
the latter ones. 

3.1. SSOR preconditioners 

Each matrix M can be decomposed into 

M = D - L-U, 

where D,—L and —U ^ represent the diag- 

onal, the stricly lower and the stricly upper trian- 
gular part of M. We assume that M has all diag- 
onal elements ^ 0, so that D, D — L and D — U are 
all non-singular. For a given relaxation parame- 
ter 7^ the SSOR preconditioner is defined by 
(see e.g.) 

For oj — 1 we thus have V — M — LD~^U as an 
approximation to M. Systems with the precon- 
ditioner V are easy to solve because ~ L and 
— U are triangular so that x in D — L)x = 
y can be obtained by a simple forward recur- 
sion, and similarly by a backward recursion in 
(il? — U)x = y. Note that the situation becomes 
more involved if we consider parallelization issues 
since recursions are known to parallelize badly. 
Assume that AI is of the particular form 

"-{-% t)^ 

This is the case for the Wilson fermion matrix 
if we use the standard odd-even ordering (with 
Di= D2 = I). If we take Vi ^ {D - L)D'^ and 
V2 = (£> - U) we get 



For the Wilson fermion matrix the second diago- 
nal block in (0) is commonly called the odd-even 
reduced system. Our discussion shows that odd- 
even reduction is nothing else but the SSOR pre- 
conditioning with respect to the odd-even order- 
ing and with lu ~ I. Very exceptionally, in this 
case it is of no harm to calculate the precondi- 
tioned matrix explicitly as done in (|^), whereas in 
general this produces too much fill-in to be prac- 
ticable. If we re-interprete D, —L, —U as block 
parts of M, the above discussion can also be used 
to derive block SSOR preconditioners. In QCD 
this can be useful in the context of improved ac- 
tions where D then is block diagonal with blocks 
of size 12 X 12. See also 0Jl| 

3.2. ILU factorizations 

The incomplete L U factorization (ILU) (see ||^ , 
e.g.) is another algebraic method to obtain a 
preconditioner V = - L)D-\D - U) for M 
where, again, D, L, U are diagonal, strictly lower 
and strictly upper triangular, respectively. These 
matrices are obtained by performing a variant of 
Gaussian elimination on M imposing restrictions 
on the amount of fill-in in the factors D — L and 
D — U so that V represents only an approximate 
(incomplete) factorization of AI. If we allow for 
no fill-in (i.e. D — L and D — U have the same 
sparsity structure as M) and if M represents a 
nearest neighbor-coupling on a regular grid, then 
L = L and U — U, so that the only difference to 
the SSOR preconditioner resides in the diagonal 
part D. For the Wilson fermion matrix with Wil- 
son parameter r = 1 both preconditioners turn 
out to be totally equal. ILU preconditioners are 
often somewhat more efficient than SSOR precon- 
ditioners, but note that they require a start-up 
phase to compute D (and L and U, in general). 

3.3. The Eisenstat trick 

If we have an SSOR or ILU preconditioner of 
the form Vi = - L)D-^ and V2 ^ {D -U), 
the product y — V-^^ALV.^^x can be computed 
as 

solve {D — U)v = x, 

solve {b - L)w ^{D- 2D)v + x, 

y — D{v + w). 
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As far as flop counts are concerned, the above 
scheme is as expensive as one multipHcation with 
M itself, except for some additional operations 
involving diagonal matrices which can usually be 
neglected. So, due to the Eisenstat trick, the ILU 
and SSOR preconditioners do not increase the 
amount of work per iteration, thus making these 
preconditioners particularly attractive. Note that 
the Eisenstat trick can also be applied in more 
general situations, see p9| . 

3.4. The influence of orderings 

When writing down the equation Mx = b we 
are free to chose any ordering for the variables, 
and the change from one ordering to another 
translates into a transformation of the kind M — > 

MP with P a permutation matrix. For both, 
the SSOR and ILU preconditioners, the spectrum 
of the preconditioned matrix depends on the or- 
dering chosen (but the Eisenstat trick can always 
be applied). There is therefore a potential to op- 
timize these preconditioners using the best order- 
ing. Typically, orderings which yield good pre- 
conditioners make the recurrencies in solving the 
triangular systems less amenable to parallel im- 
plementations. For example, the natural lexico- 
graphic ordering of lattice points in the Wilson 
fermion matrix was shown to yield a high quality 
ILU preconditioncr , but it cannot be handled 
efficiently on a distributed memory parallel com- 
puter. In [in it was shown that a new locally 
lexicographic ordering can yield up to a factor 2 
improvement over odd-even preconditioning on a 
Quadrics parallel computer. 

3.5. Polynomial preconditioning 

Another algebraic preconditioncr is obtained 
by taking — s{M) where s is a polynomial 
such that s{M) approximates M~^. So the multi- 
plication with requires deg(s) multiplies with 
M. Consequently, def;(s)-t-l steps on the original 
system are as expensive as one step on the pre- 
conditioned system and the iterates are from the 
same Krylov subspace. In this respect polynomial 
preconditioning therefore offers little advantage, 
but it was shown in ||l^] that it can be useful as 
a mean of stabilizing the MR method in certain 
situations. 



4. EXAMPLE: WILSON FERMIONS 

The generic form of the Wilson fermion matrix 
is M = / — kB where B represents the nearest 
neighbor coupling on the space-time lattice. Tak- 
ing the even-odd ordering, B has the form 

^-{b. »)■ 

The odd-even reduced matrix from is 
Me =1 -K^ ■ B2B1. 

A typical example of the eigenvalue distribution 
for M and M^ (calculated from a confined con- 
figuration on a small 4'*-lattice at /3 = 5.0 and 
K = 0.150) is given on top of Fig. 1. Note that 
all eigenvalues lie in the right half plane so that 
GMRES(fc) is known to converge for all k. A 
number /i is an eigenvalue of Me if and only if it 
is of the form /i = A(2 — A) where A is an eigen- 
value of M. We write Qe > for the smallest real 
part of an eigenvalue of Me ■ 

Both, M and Me are 75-symmetric, and we de- 
note the respective symmetrized systems by 

Q = 75 • M, Qe=75- Me. 

Q and Qe are both hermitian and half of 
their eigenvalues are negative and half are pos- 
itive. Moreover, the eigenvalue plots given in 
Fig. 1 show that except for a pair close to zero 
the eigenvalues are quite evenly distributed in 
two intervals symmetric to the origin, denoted 
[— 6e,— Oe], [ae,be] for Qe- Finally, if we consider 
MjMe = Ql, then its eigenvalues are just the 
squares of those of Qe and are therefore dis- 
tributed in the interval [a^ , &g] . 

With the information of Fig. 1 as a background, 
we can now start to discuss the condition of the 
different matrices, i.e. the numbers c™ from (^. 
First of all we realize that odd-even precondition- 
ing really makes the spectrum of M and Q 'nicer', 
since eigenvalues are mapped away from and 
are more clustered. We thus restrict the subse- 
quent discussion to the even-odd preconditioned 
matrices. For the hermitian matrices Qe and Qe, 
rather good bounds for c,„ can be derived via the 
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Figure 1. Spectra of Af, Me, Q, Qe for a 4* lattice 
at /3 = 5.0 and k = 0.150. The inlays for Q, Qf. 
represent a zoom to the eigenvalues close to 0. 



Chebyshev polynomials on [og, 5e], and we obtain 

C™(Qe)<(^^) ,C™(g2)< 



1 



1 + ff 



1 + ff 



Since MINRES is a feasible optimal method for 
Qe and CG is an optimal method for Q\ = 
AfjMe, we also can take the above numbers 
as an approximate measure for the performance 
of these methods. They indicate that CGNR, 
the CG method applied to the normal equations 
M'^Mx ^ would require half as many itera- 
tions as MINRES for Qx = 756. So in terms of 
matrix multiplies with Q (or M) - which is the 
computationally dominating part -, both meth- 
ods should be comparable. Fig. 2 gives some ex- 
perimental data, where we show the convergence 
history of MINRES and CGNR plotting the resid- 
ual norm against the number of matrix mulit- 
plies. We see that MINRES actually performs 
somewhat better than CGNR. The data comes 
from a confined configuration on a 16* lattice at 
/3 = 6.0 and k — 0.155 which yields a relative 
quark mass of 0.02, approx. (In order to observe 
substantial differences between different methods 
it is important to work on 'difficult' problems, i.e. 
with small relative quark masses.) 

For the non-hermitian matrix Mg it is not pos- 
sible to give an accurate bound on Cm, but we 
know at least that 



C™(A/e) < 



1-^ 
62 



and MR (== GMRES(l)) aheady achieves llr"]] < 
B0- "^^^ remaining parts of Fig. 2 give 
the convergence history for GMRES(fc) for several 
values of k for the same configuration as before as 
well as the corresponding results for BiCG, QMR 
and BiCGStab. In BiCG and QMR we made use 
of the savings due to 75-symmetry. One immedi- 
ately notices the more erratic behavior of BiCG 
and BiCGStab. We also see that increasing k 
in GMRES(A:) gives significant improvement, but 
there seems little use taking k larger than 8. Fi- 
nally, QMR, BiCG, BiCGStab perform best and 
quite comparably which we can interprete as an 
indication that they are all close to optimal for 
Me- This observation is backed by results from 
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|22| proving that even the full GMRES method 
did not give substantial improvement over BiCG, 
QMR or BiCGStab. 
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